Author: Adepoju David
Objective: The aim of this project is to analyze interactions within the GDG AI and Data Track group chat, and by gathering that information, a sub-objective is to also provide recommendations on how to boost interactions within the group. The analysis will focus on the following key aspects:
The data was extracted from the GDG AI and Data Track WhatsApp group as a text file, which contains timestamps, sender names, and message content. The text file will be transformed into a csv file and is available on kaggle.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
date_pattern = r'^\d+/\d+/\d+,' # Pattern to detect a new date (start of a valid message)
with open(r'C:\Users\obalabi adepoju\python\GDG.txt', encoding='utf-8') as file, \
open('cleandata.txt', 'w', encoding='utf-8') as newfile:
raw = []
i=0
for line in file:
line = line.strip()
match = re.search(r'(\d+/\d+/\d+),[\s\u202f]*(\d{1,2}:\d{2})[\s\u202f]*([AP]M)\s*-\s*([^:]*):\s*(.*)', line)
if match:
raw.append([match.group(1), match.group(2), match.group(3), match.group(4), match.group(5)])
raw[i][-1] = raw[i][-1].replace(",", " ")
i+=1
elif raw and not re.match(date_pattern, line):
raw[-1][-1] += " " + line
raw[-1][-1] = raw[-1][-1].replace(",", " ")
for entry in raw:
newfile.write(','.join(entry) + '\n')
df = pd.read_csv('cleandata.txt',names=['Date','Time','Hour','Username','Message'])
For this phase, we want to examine our dataset to understand its features and any transformations that'll be done on the dataset, we'll also be handling any null values, columns that are inconsistent with the preferred standard and creating additional metrics and columns we'll use later on in our analysis.
# we'll check out a general information about our data and then the first few records in our data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2024 entries, 0 to 2023 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 2024 non-null object 1 Time 2024 non-null object 2 Hour 2024 non-null object 3 Username 2024 non-null object 4 Message 2017 non-null object dtypes: object(5) memory usage: 79.2+ KB
print(f"This dataset contains {df.shape[0]} rows, {df.shape[1]} columns and {df['Message'].isna().sum()} null values")
This dataset contains 2024 rows, 5 columns and 7 null values
df.head(10)
| Date | Time | Hour | Username | Message | |
|---|---|---|---|---|---|
| 0 | 9/26/23 | 10:43 | AM | +234 705 765 7787 | Good morning Fam🤭🤩 |
| 1 | 9/26/23 | 10:45 | AM | +234 705 765 7787 | Our very own medium Publication is here. If y... |
| 2 | 9/26/23 | 10:49 | AM | +234 705 765 7787 | Even if you know absolutely nothing about data... |
| 3 | 9/26/23 | 6:18 | PM | +234 808 132 0599 | The link to join the Babcock Community on Medi... |
| 4 | 9/26/23 | 6:31 | PM | Anji | Hey please kindly try again the issue has bee... |
| 5 | 9/26/23 | 6:31 | PM | +234 808 132 0599 | It works now Thanks |
| 6 | 9/27/23 | 9:17 | AM | +234 812 352 4860 | If you registered for this year thinkcyber si... |
| 7 | 9/27/23 | 12:43 | PM | +234 705 765 7787 | https://x.com/gdscbabcockdata/status/170699704... |
| 8 | 9/29/23 | 12:04 | PM | Anji | https://medium.com/gdsc-babcock-dataverse/ai-v... |
| 9 | 9/29/23 | 12:09 | PM | Anji | Hiii guysss🤩 This is an extension of the disc... |
# In the general info about our dataset, there were null values in the message column so let's fix that.
df[df.Message.isna()]
| Date | Time | Hour | Username | Message | |
|---|---|---|---|---|---|
| 17 | 10/2/23 | 11:16 | AM | +234 913 069 2875 | NaN |
| 21 | 10/3/23 | 4:58 | PM | Riri | NaN |
| 340 | 10/30/23 | 7:48 | PM | Kev | NaN |
| 429 | 11/8/23 | 11:25 | AM | +234 810 883 5171 | NaN |
| 618 | 11/22/23 | 11:18 | PM | +234 705 765 7787 | NaN |
| 1356 | 2/8/24 | 3:39 | PM | +234 704 465 4314 | NaN |
| 1898 | 11/1/24 | 8:09 | PM | nekumartins | NaN |
After crosschecking with messages on GDG Data group chat, we've confirmed the null values are view once media that have been opened hence we'll be replacing these values with "view once media" and we'll also be getting rid of deleted messages.
df.fillna('view once media',inplace=True)
df = df[~df.Message.isin(['This message was deleted'])]
In the information about our dataset, notice the date, time and hour column are all objects and this is wrong so next, we'll change the data type of the date column and transform our hour and time columns into one column that represents what time of the day a message was sent.
df['Hour'] = df.Time + ' ' + df.Hour
df['Hour'] = pd.to_datetime(df['Hour'], format='%I:%M %p').dt.strftime('%H:%M:%S')
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S').dt.hour
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%y').dt.strftime('%d/%m/%y')
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')
df.drop(columns=['Time'],inplace=True)
# This is what our data looks like after transformation
df.head()
| Date | Hour | Username | Message | |
|---|---|---|---|---|
| 0 | 2023-09-26 | 10 | +234 705 765 7787 | Good morning Fam🤭🤩 |
| 1 | 2023-09-26 | 10 | +234 705 765 7787 | Our very own medium Publication is here. If y... |
| 2 | 2023-09-26 | 10 | +234 705 765 7787 | Even if you know absolutely nothing about data... |
| 3 | 2023-09-26 | 18 | +234 808 132 0599 | The link to join the Babcock Community on Medi... |
| 4 | 2023-09-26 | 18 | Anji | Hey please kindly try again the issue has bee... |
"""" notice our username columns has some numbers which is not meant to be so we'll be replacing the numbers
with their appropraiate usernames, let's first check out the records without usernames """
data_nousername = df[df['Username'].str.match(r'^\+\d{3}\s\d{3}\s\d{3}\s\d{4}$')]['Username']
print(data_nousername.nunique())
90
data_nousername
0 +234 705 765 7787
1 +234 705 765 7787
2 +234 705 765 7787
3 +234 808 132 0599
5 +234 808 132 0599
...
2015 +234 816 974 8438
2016 +234 903 060 9267
2017 +234 816 974 8438
2018 +234 815 237 2204
2019 +234 704 497 0340
Name: Username, Length: 1617, dtype: object
These are the records without usernames, and after checking with the group chat data, we've found matching names for most numbers. Any number without a username will be replaced with "Unknown".
phone_to_username = {
'+234 705 765 7787': 'Anjola (Data Lead)',
'+234 701 230 1583': 'Tominsin',
'+234 701 707 9122': 'Michael Okpechi',
'+234 701 418 0591': 'Ghost',
'+234 701 631 1706': 'Archie Ubong',
'+234 701 938 3449': 'O2',
'+234 703 135 6257': 'Emmanuel',
'+234 703 686 1996': 'Tobi',
'+234 703 716 4268': 'CakeCombat',
'+234 703 809 8061': 'Onofiok',
'+234 704 465 4314': 'Desmond',
'+234 704 497 0340': 'D',
'+234 704 552 6681': 'Your Favourite Designer',
'+234 705 426 8521': 'Human Rice',
'+234 706 655 0353': 'Dosu',
'+234 706 939 6807': 'Adefolasayo',
'+234 707 157 9947': 'Favour',
'+234 802 261 6347': 'Graphics Showdown',
'+234 802 372 1474': 'Rin',
'+234 802 378 2189': 'Omo2',
'+234 802 583 9156': 'Nsikak',
'+234 803 498 9700': 'Danilo',
'+234 803 782 4909': 'Temmy',
'+234 805 588 2047': 'Josh',
'+234 805 734 2230': 'Smile',
'+234 805 734 3297': 'D.',
'+234 806 395 7072': 'Unknown',
'+234 807 124 965': 'Marcus',
'+234 808 005 5880': '_m e l i_',
'+234 808 132 0599': 'Unanonymous',
'+234 808 427 6257': 'exotic olive oil taster',
'+234 808 595 8442': 'sorefunmi',
'+234 808 667 3100': 'Ulenyo',
'+234 808 814 1904': 'Edidiong',
'+234 808 817 4358': 'Michael.',
'+234 809 598 7234': 'Raynan',
'+234 810 055 0405': 'Mileke',
'+234 810 424 4919': 'Mide',
'+234 810 883 5171': '.',
'+234 810 940 3334': 'Ismail',
'+234 811 373 1827': 'Dakio-Horsfall',
'+234 811 857 2658': 'Caleb',
'+234 812 004 6755': 'Big Memz',
'+234 812 157 6677': 'Abolo (Former Data Lead)',
'+234 812 352 4860': 'Jeremy',
'+234 813 067 9140': 'Oj',
'+234 813 161 1815': 'Favour',
'+234 813 415 2749': 'ifeoma',
'+234 813 651 8781': 'Guy',
'+234 813 675 7509': 'Chucks',
'+234 807 124 9657': 'Marcus',
'+234 813 869 7269': 'X_L.',
'+234 814 133 8360': '_',
'+234 814 167 6471': 'Believe....',
'+234 814 219 2662': 'Unknown',
'+234 814 634 4816': 'Kenjaku',
'+234 814 793 9252': 'zee',
'+234 815 237 2204': 'Freda',
'+234 816 447 5065': 'harunafaruk',
'+234 816 974 8438': 'Favourrr',
'+234 818 365 6414': 'Daddy D The Designer',
'+234 818 597 1928': 'SamToye',
'+234 901 430 7553': 'chibuzor',
'+234 901 744 8561': 'nuel dadson',
'+234 901 817 5771': 'luxury.dev',
'+234 902 023 3535': 'IGWE',
'+234 902 082 9000': 'Ikotun',
'+234 902 145 3973': 'Winner',
'+234 902 240 0557': ':.....:',
'+234 902 652 0981': 'Imokut',
'+234 902 895 0689': 'kejleb',
'+234 903 060 9267': 'M.Blessed',
'+234 903 727 7328': 'Stargirl',
'+234 903 925 3435': 'Living Legend',
'+234 904 016 2008': 'izzy',
'+234 904 043 7948': 'Crown Ltd',
'+234 904 419 5816': 'Edidiong',
'+234 904 923 9905': 'DD',
'+234 905 702 6031': 'luxury.dev',
'+234 905 925 1484': 'Faith Olawuyi',
'+234 905 984 4630': 'Honour (E.H.H)',
'+234 906 185 0112': 'Ola',
'+234 906 395 2111': 'Kay Boi',
'+234 906 921 2399': 'Treyo',
'+234 912 895 6416': 'Fisayomi',
'+234 913 069 2875': 'IFESINACHUCKWU',
'+234 913 167 8833': '🥷',
'+234 913 186 9375': 'Shepherd',
'+234 913 418 0175': 'Sharon',
'+234 915 261 7900': '😎😎',
'+234 915 635 0191': 'Chibu',
'+234 915 762 0814': 'Oluwasemilogo',
'+234 815 671 3736': 'Abolo (Former Data Lead)',
'+234 814 893 5061': 'M.I',
'+234 915 436 8771': 'Prospect'
}
df['Username'] = df['Username'].replace(phone_to_username)
df['Username'] = df['Username'].replace({'Anji':'Anjola (Data Lead)','Riri':'Riley (GDG Lead)'})
df[df['Username'].str.match(r'^\+\d{3}\s\d{3}\s\d{3}\s\d{4}$')]
| Date | Hour | Username | Message |
|---|
#we're only analyzing messages in the year 2024 so we filter out the rest
df = df[df.Date.dt.year > 2023 ].reset_index(drop=True)
Alright, now that that's done, let's add a few columns to our data we'll use later on in our analysis.
The following will be added to our dataset:
df['Time'] = df['Hour'].apply(lambda x:'Late Night' if x<6 else 'Morning' if x<12
else 'Afternoon' if x<18 else 'Evening' if x>=18 else 'Nan')
df['WordCount'] = df['Message'].str.len()
df['Weekday'] = pd.to_datetime(df['Date']).dt.day_name()
df['Month'] = pd.to_datetime(df['Date']).dt.month_name()
df['Weekend'] = df['Weekday'].apply(lambda x :'No' if x in ['Monday','Tuesday','Wednesday','Thursday','Friday'] else 'Yes')
Now we want to quantify the message content and use that to create our Quality column. This is how it works:
A compute_quality_score function will be created to calculate a score based on the ratio of "good words" (non-stopwords) to total words in a message. Here's how the score behaves:
len_good_words) is 0, resulting in a quality score of 0. len_good_words) – If there are no stopwords, the score is approximately len_good_words, since the formula simplifies to len_good_words when all words are good. Range: The quality score ranges from 0 to N, where N is the number of unique non-stopwords in the message.
stopwords = {'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"}
def compute_quality_score(string: str):
words = word_tokenize(string.lower())
words = {word for word in words if word.isalpha()}
len_good_words = len(words - stopwords)
len_words = len(words) + 1e-15
return (len_good_words * len_good_words) / len_words
def compute_quality_ranking(string: str):
words = word_tokenize(string.lower())
words = {word for word in words if word.isalpha()}
len_good_words = len(words - stopwords)
len_words = len(words) + 1e-15
return round((len_good_words/len_words) * 5,2)
df['Quality'] = df.Message.map(compute_quality_score)
frequency = df.groupby('Username').agg(Total=('Username','count')).reset_index().sort_values('Total',ascending=False)
consistency = df.groupby(['Date','Username']).agg(w=('Username','count'))
consistency = consistency.groupby('Username').agg(active_days=('w','count')).reset_index()
quality = df.groupby('Username').agg(Words=('Message', lambda x: ' '.join(x))).reset_index()
data = frequency.merge(consistency, on='Username')
To create our activity level feature, we will use three key factors to categorize our data:
Each of these factors will be ranked on a scale of 1 to 5, and the average of the three scores will determine the activity level category for each member.
After thoroughly examining our data, we found that the distribution of message content follows a power law. Specifically, out of 400 members, only 27% are active in the group. Among these active users, a small group contributes to over 30% of the total messages. This means that a few individuals drive most of the conversations.
In response to this, we have adjusted the ranking system for 2 of our factors to ensure that it accurately reflects the level of interaction within the group. The rankings for both factors have been modified so that not everyone falls into the same category, ensuring a more balanced and meaningful classification.
Below are the rankings for both factors.
| Contribution % | Score (1-5) | Active Days % | Score (1-5) |
|---|---|---|---|
| ≥ 7% | 5 | ≥ 20% | 5 |
| 4% – 6.99% | 4 | 10% – 19.99% | 4 |
| 2% – 3.99% | 3 | 5% – 9.99% | 3 |
| 1% – 1.99% | 2 | 2% – 4.99% | 2 |
| 0.5% – 0.99% | 1 | < 2% | 1 |
data['Total'] = round((data['Total']/df.shape[0]) * 100,2)
data['Total'] = data['Total'].apply(
lambda x: 5 if x >= 5 else
4 if 3 <= x < 5 else
3 if 1 <= x < 3 else
2 if 0.5 <= x < 1 else
1
)
data['active_days'] = round((data['active_days']/df['Date'].nunique()) * 100,2)
data['active_days'] = data['active_days'].apply(
lambda x: 5 if x >= 20 else
4 if 15 <= x < 20 else
3 if 10 <= x < 15 else
2 if 5 <= x < 10 else
1 if x < 5 else 0
)
quality['quality'] = quality.Words.map(compute_quality_ranking)
data = pd.merge(data,quality[['Username','quality']],on='Username')
data['Score'] = round((data['Total']+data['active_days']+data['quality'])/3,1)
data['Activity_level'] = data['Score'].apply(
lambda x: 'Low' if 0 <= x < 2.5 else
'Average' if 2.5 <= x <= 3.5 else
'High')
df = pd.merge(df,data[['Username','Activity_level']], on='Username', how='left',sort=False)
# our data now looks like this
df.head()
| Date | Hour | Username | Message | Time | WordCount | Weekday | Month | Weekend | Quality | Activity_level | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-01-01 | 2 | Anjola (Data Lead) | Happy New Year my data family❤️❤️🥳🥳🥳 | Late Night | 36 | Monday | January | No | 3.200000 | High |
| 1 | 2024-01-01 | 2 | Edidiong | same to youu | Late Night | 12 | Monday | January | No | 0.333333 | Average |
| 2 | 2024-01-01 | 2 | exotic olive oil taster | happy new year anj❤️ | Late Night | 20 | Monday | January | No | 3.000000 | Average |
| 3 | 2024-01-01 | 10 | Chucks | Happy new year to you too😊 | Morning | 26 | Monday | January | No | 1.800000 | Low |
| 4 | 2024-01-01 | 11 | Crown Ltd | Happy New Year 🎇 | Morning | 16 | Monday | January | No | 3.000000 | Low |
print(f"In 2024, the GDG AI and Data Track amassed a total of {df.shape[0]} messages.")
In 2024, the GDG AI and Data Track amassed a total of 1275 messages.
#Let's visualize the distribution of monthly messages
fig = px.histogram(x=df['Date'],color_discrete_sequence=['blue'],nbins=12)
fig.update_traces(marker_line_color='black',marker_line_width=0.5)
fig.update_layout(xaxis_title='Month',yaxis_title='Count of Messages')
fig.show()
From the plot, we observe the total number of messages per month from the start to the end of the year. The distribution appears right-skewed, meaning there is a higher number of messages at the start of the year, with a steady decline as the months progress. Notably, there’s a spike toward the end of the year, we'll check this out later. Due to this skewness, we will use the median value to calculate the average number of daily messages.
answer = df.groupby('Date').agg(Total=('Message','count')).reset_index()
print(f"On average, {round(answer['Total'].median())} messages were sent daily.")
On average, 4 messages were sent daily.
Now we're curious about the quantity of a message sent by members of the group so using our message length column, we'd like to answer the question of
It is important to note this does not determine the quality of the message or how much it contributes to the conversation but we'll still like to know just how much members say.
answer = df.groupby('Username').agg(Total=('Message','count'),Totlength=('WordCount','sum')).reset_index()
answer['lengthpermessage'] = round(answer['Totlength']/answer['Total'],2)
Q1 = answer['lengthpermessage'].quantile(0.25)
Q3 = answer.lengthpermessage.quantile(0.75)
IQR = Q3-Q1
upper_fence = Q3 + 1.5 * IQR
lower_fence = Q1 - 1.5 * IQR
data = answer[(answer.lengthpermessage > lower_fence) & (answer.lengthpermessage < upper_fence)]
fig = px.violin(data,'lengthpermessage',title = 'Length per Message Distribution',
color_discrete_sequence=['mediumseagreen'])
fig.show()
Based on the plot above, we observe that most members have between 15 and 60 words per message, with a median of 25 words so while members don’t typically write lengthy messages, the content they contribute is sufficient for engagement.
Now, we'd like to create a word cloud to get an idea of what topics members converse about the most by answering the question below.
h = {}
answer = ''
for i in df['Message']:
i = i.strip()
answer += i + " "
answer = answer.strip().split()
for i in answer:
if i.lower() in stopwords:
continue
h[i.lower()] = h.get(i.lower(), 0) + 1
h = {key: value for key, value in h.items() if value >= 2}
h = dict(sorted(h.items(), key=lambda item: item[1],reverse=True))
%%capture
pip install wordcloud
After filtering the dictionary further to remove any undesired words such as "make, please, use..." etc. Below are the 12 most occuring words in the GDG Data & AI track.
from wordcloud import WordCloud
word_freq = {
"Data": 110,
"Learning": 91,
"AI": 52,
"Datacamp": 49,
"Google": 49,
"Model": 40,
"Machine": 27,
"Research": 25,
"Babcock": 23,
"ML": 22,
"Classification":20,
"Linear":20
}
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Now, we'd like to check if the quality of a message sent has anything to do with its length so we'll use a scatter plot to answer the below question
fig = px.scatter(df, x='WordCount', y='Quality',trendline='ols')
fig.show()
# Our scatter plot's pretty convincing but let's still calculate the r coefficient to be sure
print(df['WordCount'].corr(df['Quality']))
0.9594462008530914
So from our scatter plot and from the r coefficient, we can say that longer messages are associated with higher quality. Please note that this doesn't mean a lengthy message = a high quality one as it's entirely possible that several words in a single message add nothing to the conversation.
The first question we would like to answer here is
# We'll answer the question with sub plots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Hour'], color='blue',bins=24)
plt.title('Histogram of Hour')
plt.subplot(1,2,2)
sns.countplot(x=df['Time'],color='blue')
plt.title('Histogram of Time')
plt.tight_layout()
plt.show()
From the two subplots, we can observe that the majority of interactions occur between 8 PM and midnight, with members primarily active in the evenings until they likely go to bed. Additionally, there is also a significant level of interaction between the hours of 10am - 1pm, making the late morning and afternoon the second busiest time for interactions.
Next, we'd also like to check whether weekdays or weekends affect the level of interaction in the group and which days have the highest number of messages.
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.countplot(x=df['Weekday'], color='lightgreen')
plt.title('Histogram of Weekday')
plt.subplot(1,2,2)
sns.countplot(x=df['Weekend'],color='lightgreen')
plt.title('Histogram of Weekend')
plt.tight_layout()
plt.show()
The bar plots above are able to show us what days of the week have the most interactions which is Sunday.
I'd also like to put it out there there might be more messages on sundays than on saturdays for the weekend due Babcock University's unique but backwards culture but this isn't anything more than speculation as more detailed information is required for this to be a fact.
Tuesdays and Wednesdays also have significant levels of interaction but contrary to what you might be led to believe from the second plot, it cannot be said that there is a higher level of interaction on weekdays than on weekends because we have not factored in the number of days and so we'll now answer the below question in the next cell.
answer = df.groupby('Weekday').agg(Total=('Message','count'),Weekend=('Weekend','first')).reset_index()
answer = answer.groupby('Weekend').agg(Average=('Total','mean'))
answer
| Average | |
|---|---|
| Weekend | |
| No | 169.4 |
| Yes | 214.0 |
From what we've gathered,on average, interactions are higher on weekends (216.0) compared to weekdays (171). Despite having more total interactions on weekdays (since there are more weekdays in a week), individual weekend days have higher levels of engagement per day than individual weekdays.
So does this answer our question?, yes because we know that individual weekend days have higher interaction levels but nonethless it's important ot be aware that this doesn't tell us whether more people interact on the group during weekends as a high number of messages can mean fewer active members but more messages.
Since sundays are the days with the most interactions, we'd also like to check which hours of the days have the most number of messages.
df_sunday = df[df.Weekday == 'Sunday']
fig = px.histogram(df_sunday['Hour'], color_discrete_sequence=['mediumpurple'],nbins=24, title ='Histogram of Hour')
fig.update_traces(marker_line_color='black',marker_line_width=0.5)
fig.show()
Looks like most interactions on sundays take place between 10am - 1pm.
The next question we want to answer is
df['Date'].value_counts()
Date
2024-01-28 150
2024-01-10 117
2024-01-17 68
2024-01-30 64
2024-10-06 55
...
2024-09-26 1
2024-10-01 1
2024-10-03 1
2024-05-06 1
2024-08-05 1
Name: count, Length: 111, dtype: int64
The two most notable days are the 10th and the 28th of january amassing exactly 267 messages in total. I'm going to give a slight summary of topics that were discussed and which members actively participatd in these two conversations either by asking questions or making a contribution. Note that the following is not necessary to read, you may only do so if you're curious.
Sunday, 28th of January
The major topics discussed were on LLM, AI, Google technologies intergrated with AI and other tech companies that have released AI models. It started with Samuel Abolo(Data Lead 2023/2024) answering a couple questions that were asked previously on an anonymous link. The usernames of other members that contributed to the messages include Tobi and Guy. The day ended with a brief discussion about a study jam held that night on Linear regression, members that were a part of the conversation include Abolo & Chibuzor.
Wednesday, 10th of January
The conversation started with messages about the datacamp scholarship being offered by GDG Babcock University and the data member that contributed the most to this convo was Anjola(Data Lead 2024/2025). The rest of the conversations on that day were primarily about a fun little session where data members tried to classify whether a problem could be handled by a model trained using either supervised, unsupervised or reinforcement learning. Usernames of key contributors are Abolo, M.I and Oba.
To begin, we'll first of all check out our activity level column to visualize our member categories.
member = df.groupby('Username').agg(Activity=('Activity_level','first'))
member = member['Activity'].value_counts()
fig = px.histogram(x=member.values,y=member.index,title='Histogram of Activity Level Categories')
fig.update_traces(marker_line_color='black',marker_line_width=0.5)
fig.update_layout(xaxis_title='Count of Activity Level',yaxis_title='Activity Level')
fig.show()
From the plot above, we gather that out of the 400+ members in the GDG Data Track, only about 19% have sent at least one message. Among these "active" members, over 80% fall into the low activity category, while those in the high activity category make up just 0.01%.
Well then, let's check out the A-listers in our data group chat! First, in terms of consistency
members = df.groupby('Username').agg(Total=('Date', 'nunique')).reset_index()
members['Percentage of Active Days'] = round((members['Total']/df['Date'].nunique())*100)
answer = members.sort_values('Percentage of Active Days',ascending=False).head(5)
answer['Username'].values
array(['Anjola (Data Lead)', 'Abolo (Former Data Lead)',
'Riley (GDG Lead)', 'chibuzor', 'exotic olive oil taster'],
dtype=object)
Above are the names of our most consistent members, ranked in descending order of consistency. Leading the list is our very own track lead Anjola, the most dedicated member of the GDG Data and AI track. A special mention goes to our 5th most consistent member—bonus points for the unique username: Exotic Olive Oil Taster.
Now we'd love to see out top 5 members in terms of the number of messages sent to the group and we'll add some additional features such as what day these members were most active, at what time of the day have they sent the most messages and so on.
members = df.groupby('Username').agg(Total=('Message', 'count')).reset_index()
members['Contribution'] = round((members['Total']/df['Message'].shape[0]*100))
answer = members.sort_values('Contribution',ascending=False).head(5)
Top_5 = df[df.Username.isin(answer['Username'].values)]
Top_5.groupby('Username').agg(peak_weekday=('Weekday',pd.Series.mode),more_active_on_weekends=('Weekend',pd.Series.mode),
peak_hour=('Hour',pd.Series.mode),peak_day=('Date',pd.Series.mode))
| peak_weekday | more_active_on_weekends | peak_hour | peak_day | |
|---|---|---|---|---|
| Username | ||||
| Abolo (Former Data Lead) | Sunday | No | 22 | 2024-01-28 |
| Anjola (Data Lead) | Sunday | No | 20 | 2024-11-01 |
| Guy | Sunday | No | 11 | 2024-01-28 |
| Tobi | Sunday | Yes | 11 | 2024-01-28 |
| chibuzor | Tuesday | No | 0 | 2024-01-30 |
It's worth pointing out the top 5 members have a couple of features in common. Firstly, most peak on Sundays, suggesting shared interactions patterns, but only Tobi is more active on weekends. Abolo and Anjola prefer late-night activity, while Guy and Tobi peaks earlier. Chibuzor stands out with a Tuesday peak and an unusual midnight activity.
A special shoutout to our most active members who also happen to be the most consistent—your dedication keeps the GDG Data and AI track thriving! Keep the momentum going!
Recommendations for Data members looking to boost interactions within the group
Encourage Evening and Weekend Discussions: Since most interactions happen between 8 PM and midnight and weekends show higher engagement per day, members should initiate conversations during these peak hours to maximize participation.
Engaging in Popular Topics Discussions on AI models, Google technologies, and DataCamp opportunities attracted significant engagement. Organizing structured conversations or Q&A sessions around these topics can sustain interest.
Organize More Interactive Sessions Events like the Linear Regression study jam and the supervised vs. unsupervised learning challenge boosted participation. Hosting similar engaging activities can keep discussions lively and educational.
If you're interested in seeing more insights from this data, here's the link to the dashboard created in power BI and if you're also interested in carrying out your own analysis, this is the link to the dataset on kaggle. It contains two csv files, one on the messages in the group and the other on the members of the group.
This is the end. If you found this project worth your while or you want to collaborate on any project, feel free to reach out to me on any of the following: